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ABSTRACT 

Expectations about tlie correlation of cue plirases, tlie duration of 
unfilled pauses and the structuring of spoken discourse are framed 
in light of Grosz and Sidner's theory of discourse and are tested for 
a directions-giving dialogue. The results suggest that cue phrase 
and discourse structuring tasks may align, and show a correlation for 
pause length and some of the modifications that speakers can make 
to discourse structure. 

1. Introduction 

Because an utterance is best understood in the context in which 
it is delivered, its interpreters must be able to identify the rel- 
evant context and recognize when it is altered, supplanted or 
revived. The transient nature of speech makes this task diffi- 
cult. However, the difficulty is alleviated by the abundance of 
lexical and prosodic cues available to a speaker for communi- 
cating the location and type of contextual change. The inves- 
tigation of the interaction between these cues presupposes a 
theory of contextual change in discourse. The theory relating 
attention, intentions and discourse structure [^] is particularly 
useful because it provides a computational account of the cur- 
rent context and the mechanisms of contextual change. This 
account frames the questions I investigate about the correla- 
tion between between lexical and prosodic cues. In particular, 
the theory motivates the selection of the cue phrase^ — a 
word or phrase whose relevance is to structural or rhetorical 
relations, rather than topic — and the unfilled pause (silent 
pause) as significant indicators of discourse structure. 

2. The tripartite nature of discourse 

To explain the organization of a discourse into topics and 
subtopics, Grosz and Sidner postulate three interrelated com- 
ponents of discourse — a linguistic structure, an intentional 
structure and an attentional state ||3|. In the linguistic struc- 
ture, the linear sequence of utterances becomes hierarchical 
— utterances aggregate into discourse segments, and the dis- 
course segments are organized hierarchically accordinglp the 
relations among the purposes or discourse intentions^ that 
each satisfies. 



' Discourse intentions are those goals or intentions intended to be recog- 
nized by each participant as the purpose to which the current segment of talk 
is devoted. 



The relations among discourse intentions are captured in the 
intentional structure. It is this organization that is miiTored by 
the linguistic structure of utterances. However, while the lin- 
guistic structure organizes the verbatim content of discourse 
segments, the intentional structure contains only the intentions 
that underlie each segment. The supposition of an intentional 
structure explains how discourse coherence is preserved in 
the absence of a complete history of the discourse. Rather, 
discourse participants summarize the verbatim contents of a 
discourse segment by the discourse intention it satisfies. The 
contents of a discourse segment are collapsed into an intention, 
and intentions themselves may be collapsed into intentions of 
larger scope. 

The discourse intention of greatest scope is the Discourse 
Purpose (DP), the reason for initiating a discourse. Within 
this, discourse segments are introduced to fulfill a particular 
Discourse Segment Purpose (DSP) and thereby contribute to 
the satisfaction of the overall DP. A segment terminates when 
its DSP is satisfied. Similarly, a discourse terminates when 
the DP that initiated it is satisfied. 

The attentional state is the third component of the tripartite 
theory. It models the foci of attention that exist during the 
construction of intentional structures. The global focus of 
attention encompasses those entities relevant to the discourse 
segment currently under construction, while the local focus 
(also called the center^) is the currently most salient entity 
in the discourse segment. The local focus may change from 
utterance to utterance, while the global focus (i.e., current 
context) changes only from segment to segment. 

The linguistic, intentional and attentional components are in- 
terrelated. In particular, the attentional state describes the pro- 
cessing of the discourse segment which has been introduced 
to satisfy the current discourse intention. The functional in- 
terrelation is expressed temporally in spoken discourse — the 
linguistic, intentional and attentional components devoted to 
one DSP co-occur. Therefore, a change in one component 
reflects or induces changes in the rest. For example, changes 
ascribed to the attentional state indicate changes in the inten- 
tional structure, and moreover, are recognized via qualitative 
changes in the linguistic structure. It is because of their inter- 
dependence and synchrony that I can postulate the hypothesis 



that co-occurring linguistic and attentional phenomena in spo- 
ken discourse — cue phrases, pauses and discourse structure 
and processing — are linked. 

The part of the theory most directly relevant to my investi- 
gation are those constructs that model the attentional state. 
These are the focus space and the focus space stack. The 
focus space is the computational representation of process- 
ing in the current context, that is, for the discourse segment 
currently under construction. Within a focus space dwell rep- 
resentations of the entities evoked during the construction of 
the segment — propositions, relations, objects in the world 
and the DSP of the current discourse segment. 

A focus space lives on a pushdown stack called the focus space 
stack. The progression of focus in a discourse is modeled via 
the basic stack operations — pushes and pops — applied to the 
stack elements. For example, closure of a discourse segment 
is modeled by popping its associated focus space from the 
stack; introduction of a segment is modeled by pushing its 
associated focus space onto the stack; retention of the current 
discourse segment is modeled by leaving its focus space on 
the stack in order to add or modify its elements. 

The contents of a focus space whose DSP is satisfied are 
accrued in the longer lasting intentional structure. Thus, at 
the end of a discourse the focus space stack is empty while 
the intentional structure is fully constructed. 

The focus space model abstracts the processing that all partic- 
ipants must do in order to accurately track and affect the flow 
of discourse. Thus, it treats the emerging discourse struc- 
ture and the changing attentional foci as publicly accessible 
properties of the discourse. However, although the partici- 
pants themselves may act as if they are manipulating public 
structures, the informational and attentional properties of a 
discourse are, in fact, modeled only privately. 

In explaining certain lexical and prosodic features of dis- 
course, it is often useful to return to these private models. For 
a speaker's utterance is conditioned both by the state of the 
her own model and by her beliefs about those of her inter- 
locutors. The time-dependent nature of speech emphasizes 
the importance of synchronizing private models. Lexical and 
prosodic focusing cues hasten synchronization. In particu- 
lar, they guide the listeners in updating their models (among 
them, the focus space stack) to reflect the attentional changes 
already in effect for the speaker 

For my analysis, the most relevant private model belongs to 
the current speaker, whose discourse intentions guide, for the 
moment, the flow of topic and attention in a discourse and 
whose spoken contributions provide the richest evidence of 
attentional state. If cue phrases and unfilled pause durations 
can be shown to correlate with attentional state (and by defi- 
nition, the intentional and linguistic structure), the attentional 



state they reveal belongs to the current speaker, and the atten- 
tional changes they denote are the ones the speaker makes in 
her own private model. 



3. Main Hypotheses 

The theory of the tripartite nature of discourse frames my hy- 
potheses about the correlation of cue phrases, pause duration 
and discourse structure. The main hypotheses are these: that 
particular unfilled pause durations tend to correlate with par- 
ticular cue phrases and that this correlation is occasioned by 
changes to the attentional state of the discourse participants, 
or, equivalently, by the emerging intentional structure of a 
discourse. 

Cue phrases Changes to the attentional state occur at seg- 
ment boundaries. Cue phrases by definition evince these 
changes — they are utterance- and segment-initial words or 
phrases and they inform on structural or rhetorical relations 
rather than on topic. Thus, for cue phrases, the question is 
not whether they correlate with attentional state, but how. To 
answer this question, we ask, for each cue phrase (e.g.. Now, 
To begin with, So), whether it signals particular and distinct 
changes to the attentional state. 

Pauses The correlation of unfilled pauses with attentional 
state is less certain because pauses appear at all levels of 
discourse structure. They are found within and between the 
smallest grammatical phrase, the sentence, the utterance, the 
speaking turn and the discourse segment. Their correlation is 
mainly with the cognitive difficulty of producing a phrase or 
utterance ijl]]. To link this correlation with the task of produc- 
ing discourse structure, we must posit a variety of attentional 
operations with corresponding variability in cognitive diffi- 
culty. Specifically, we construct the chain of assumptions 
that: 

• More than one attentional operation exists (e.g., initia- 
tion, retention, closure). 

• The different attentional operations are distinguished 
by their effect on the attentional state and by the cog- 
nitive difficulty of their production. 

• The amount of silence preceding an attentional opera- 
tion is correlated with the greater or lesser demands it 
makes on mental processing. 

To link unfilled pause duration to discourse structure we must 
first establish that operations on the attentional state can be 
distinguished sufficiently to explain the different demands that 
each operation makes on discourse processing and which, 
therefore, might be reflected in the duration of segment-initial 
unfilled pauses. 



4. Auxiliary hypotheses 

The linking of pause duration to the processing of discourse 
segments motivates some auxihary hypotheses that refine no- 
tions about the kinds of mental operations sanctioned by the 
focus space model and about the internal structure of a dis- 
course segment. These auxiliary hypotheses are developed in 
this section. 

4.1. Attentional operations 

In the theory of discourse structure, changes to the attentional 
state are modeled as operations on the focus space stack. 
These operations appear reducible to four distinct sequences 
of stack operations that correspond to four distinct effects on 
the attentional state, as follows: 

• One push — Initiate a new focus space. 

• No push, no pop — Retain the current focus space. 



• One or more pops 
focus space. 

• One or more pops followed by a push 
previous focus space(s) with a new one. 



Return to a previously initiatedB 
Replace a 



The arrangement is asymmetrical in that it is possible to pop 
more than one focus space per operation, but to push only one, 
as shown in Table |l} 



Focus space Focus space 
stack before stack after 
Operation operation operation Summary 



Initiate 



Retain 



Return 



Replace 



FS, 



FSi 



FSi 



FS, 



FS, 



FS2 



FSi 



FS, 



FS, 



FS, 



One push. 



No push, 
no pop. 

One or more 
pops. 

One or more 
pops, followed 
by a push. 



Table 1 : The effect of the four focusing operation on the 
focus spaces (FS) in the pushdown focus space stack. 

The decomposition of focus space operations into stack op- 
eration primitives is not merely an attempt to impose a com- 
putational patina on descriptive terms. Rather, it suggests 
that operations that differ in kind and number place differ- 



ent requirements on mental processing for both speaker and 
therefore might be accompanied by lexical and acoustical phe- 
nomena that also differ 

4.2. Structure of a discourse segment 

To further motivate the particular usefulness of cue phrases 
and unfilled pauses as locators of discourse segment bound- 
aries and markers of attentional state, it is useful to distinguish 
among three phases in the life of a discourse segment (and its 
focus space counterpart) — its initiation, development and 
closure . We make the additional assumptions that each phase 
may be marked explicitly or implicitly and by lexical and 
acoustical phenomena.B 

From inspection of dialogue, it appears that the development 
phase must be instantiated explicitly with lexical contribu- 
tions, while the boundary phases need not be. However, while 
lexical marking of segment boundaries is optional, prosodic 
marking is not. Thus, at initiation of a discourse segment 



we find, for example, an expanded pitch range 1 12 1 and at its 
closure, phrase-final lowering [||| and syllable lengthening [^]. 

Sometimes, the same structural cue is implicit for one segment 
yet explicit for another For example, in a Replace operation, 
explicitly marked closure of one segment implicitly permits 
the initiation of the next. Conversely, an explicitly marked 
Initiation of the current segment testifies implicitly to the 
closure of the previous one. 

Boundary phenomena are of special relevance toward retriev- 
ing discourse structure from a multiplicity of lexical and 
acoustic clues. The distinction between explicit and implicit 
correlates for each phase of segment construction admits four 
classes of segment boundary phenomena — phenomena that 
are: explicit and segment-initial; implicit and segment-initial; 
explicit and segment-final; and implicit and segment-final. An 
investigation of how cue phrases and unfilled pauses reflect 
discourse structure and the state of its processing is thus an 
investigation of the explicit and segment-initial evidence of 
focus space initiation. 

The selection of segment-initial phenomena in no way implies 
that segment-final phenomena are any less crucial to the com- 
munication and recognition of discourse segment boundaries. 
Nor does the selection of cue phrases and unfilled pauses 
minimize the contributions of other lexical and prosodic phe- 
nomena. Rather, these selections are motivated by features 
of the focus space model that both cue phrases and unfilled 
pauses might specially illuminate, and conversely, by features 
of the model that might specially illuminate the discourse 
function of cue phrases and unfilled pauses. These features 
are described in the following two sections. 



^When at least one focus space remains on the stack, the discourse con- 
tinues. When none remain, however, the discourse is ended. 



' Gestural coirelates of discourse structure and processing are outside the 
scope of this investigation. 



5. Cue phrases, discourse markers and 
attentional state 

Cue phrases are those words or phrases which introduce an 
utterance — e.g.. To begin with. First of all. Now, But — 
and coordinate the flow of conversation and focus rather than 
contribute directly to the topic at hand. They provide broad, 
topic independent indications of how the speaker intends to 
relate the current utterance to those preceding it, thus locating 
the utterance in the discourse structure. The information they 
convey is attentional, intentional or both. 

The study of cue phrases and their correlation with discourse 
structure and focus of attention is most extensive for the dis- 
course marker^ld^ subcategory. Schiffrin's work in particular, 
is the basis for my predictions about the structural effects of 
cue phrases on the focus space model. 

5.1. Discourse markers 

Discourse markers are generally single word phrases, such as 
Well, Now, Then or So, whose pragmatic role in a discourse 
usually follows from their syntactic and semantic role in a 
grammatical phrase. That is, if a word in semantic guise re- 
lates propositions in a grammatical phrase, it marks in its 
pragmatic guise the same or similar relation between utter- 
ances in a discourse. For example [[ic[]: 

• And, as a discourse marker, indicates connectedness, 
conveying the speaker's view that the utterance it heads 
is connected to the prior discourse. The connection 
may be to the immediately previous utterance or to the 
speaker's prior [interrupted] turn. 

• But also a marks connectedness, but connects utter- 
ances in a confroif relation. The contrast may be struc- 
tural (resumption after a digression or interruption) or 
rhetorical. Like well, it introduces unexpected or un- 
desired material, but in a less cooperative manner 

• I mean precedes a repair or modification of the 
speaker's own contribution or highlights something to 
which the speaker believes the hearer should attend. 

• So may precede a presentation of a result, and indicates 
transitions to a higher level, in contrast to "because" 
which indicates progressive embedding. 

• Now emphasizes what the speaker is about to do, and 
is often used to introduce evaluations. 

• Well is often used in response, when the possibilities 
offered by the previous speaker are inadequate. It indi- 
cates an awareness of conversational expectations but 
also heralds a violation of the previous speaker's ex- 
pectations. 

• You know indicates an appeal to shared knowledge and 
mutual beliefs. 



5.2. Discourse markers reinterpreted 

Some of the observations about the conversational role of dis- 
course markers invoke structural effects (embedding, return 
to a higher level) although without detailing the structure in 
question. A more unified and computationally driven account 
might be posed in terms of operations on the focus space stack, 
as follows: 

• And (connectedness): Retain, Return. 

• But (contrast): Retain, Replace or Return. 

• I mean (modification or repair): Initiate, Retain. 

• So (presentation of a result): Return, Replace. 

• Because (progressive embedding): Initiate. 

• Now (what the speaker is about to do): Replace. 

• Well (inadequate options): Replace. 

• You know (appeal to shared knowledge): Retain, or 
Initiate when it precedes an aside. 

In addition, there are the cue phrases that highlight structural 
or propositional ordinality. The first use of such a phrase 
(e.g.. To begin with. In the first place,) is likely to denote a 
focus space Initiation while subsequent uses (e.g.. Secondly, 
Finally,) denote a focus space Replacement. 

These formulations are not deterministic. They illustrate, 
however, the hypothesis that certain of the discourse markers 
are more likely to betoken certain focusing operations. Under 
what conditions might such correspondences exist? Clearly, 
features of the context in which a cue phrase is used might 
constrain its effect on focusing, and so explain how conver- 
sants are able to track focus from cues that, by themselves, 
are ambiguous. 

Thus, to select the probable from the possible, corroboration 
from other quarters is required. Lexical corroboration may 
be semantic, from domain specific evidence of topic change 
or continuation. Or it may be syntactic, from those syntac- 
tic distributions that tend not to cross segment boundaries 
(tense, aspect and the scope of referring expressions [^)cl Al- 
ternatively, prosodic features are likely to better identify the 
current use of a cue phrase from those that are possible. 

6. Unfilled pauses and attentional state 

The most useful prosodic correlates of discourse segmen- 
tation occur at segment boundaries and indicate either the 
opening of a new segment, closure of the old or both. For 
example, a phrase-final continuation rise forestalls segment 
closure while phrase-final lowering confirms it[^. And ex- 
panded pitch range tends to mark the introduction of new 

''For example. Walker and Whittaker observe that deictic pronominal ref- 
erence may cross segment boundaries, while nondeictic pronominal reference 
does so only rarely |jl3||. 



topics, while reduced pitch range marks subtopics and paren- 
theticals. Similarly, voice quality changes, e.g., from normal 
to creaky voice, may accompany attentional and intentional 
changes. 

Filled pauses (e.g., Um, uh) and unfilled pauses appear at 
segment boundaries but are also found within a discourse 
segment and in the smaller groupings it contains. In contrast 
to the propositional and attentional accounts of intonational 
cues|^], accounts of pausing invoke the demands of cognition 
and pragmatics. For example, the duration of unfilled pauses 
has been observed to correlate with the cognitive difficulty 
involved in producing an utterance [|l |, while filled pauses may 
function as a floor holding device[7|, or perhaps, correlate 
with the speaker's emotional response to topic[|l]|. 



As corroborators of attentional interpretations of cue phrases 
filled pauses are less useful than unfilled pauses because they 
overlap with cue phrases in both form (partially lexicalized) 
and function. A more independent measure is provided by 
unfilled pauses which are not lexicalized and therefore carry 
neither lexical nor intonational propositions. Rather, as corre- 
lates of the cognitive processing, they may also correlate with 
the specific differences among stack operations, which, after 
all, are cognitive operations, albeit idealized. 

The selection of unfilled pause duration as a possible marker 
of attention and segmentation also has the practical advantage 
of being easy to locate instrumentally and easy to check per- 
ceptually. Moreover, its measurement is unambiguous instru- 
mentally and requires less from perception, than, for example, 
intonational prosodic cues. For, while intonational features 
are categorical according to their type (combinations of the 
L, H and * tokens [^) and the structure to which they apply 
(word, intermediate phrase, intonational phrase), pause dura- 
tion is ordinal and is measured on the same continuous linear 
scale for all levels of Hnguistic and intonational structures. 

7. Questions and predictions 

My investigation is inspired by the theory relating attentions, 
intentions and discourse structure[^. To the more specific 
observations linking cue phrases to attentional state [|^ |l^ 
and the duration of unfilled pauses to increased cognitive 
difficulty [|l]], I add the assumption of four fundamental fo- 
cusing operations. Together, they motivate my hypotheses 
that: 

( 1 ) Specific cue phrases betoken specific focusing opera- 
tions. 

(2) Differences in the cognitive difficulty of the focusing 
operations are reflected in the duration of the pauses 
that precede them. 



From these hypotheses come the specific questions that guide 
the research: 

• Is there a correlation between the focusing operations 
and the duration of the pause that precedes it? 

• Are cue phrases correlated with focusing operations — 
how often and under what circumstances? 

• What is the relation of pausing and cue phrases — do 
they substitute for each other, compliment each other or 
play different roles such that one is required or allowed 
where the other is not? 

• Is there a unique minimum cognitive cost for each stack 
primitive (Push, Pop) of which focusing operations are 
composed, and that would therefore explain differences 
in segment-initial pause duration? 

In addition, the hypotheses raise questions not immediately 
answerable: 

• If there are indeed patterns of usage, do they differ pre- 
dictably for different discourse features, for example, 
by format (monologue or dialogue) or according to the 
planning effort (prepared or extemporaneous) required 
in formulating each utterance? 

• If on the other hand, correlations are partial at best, can 
other lexical or prosodic features provide the missing 
correlates? 

Research into these questions is not without its biases. Thus, 
I expected to find in my discourse samples the following 
correlations: 

• Unfilled pause duration and focusing operation are 
correlated. 

• Cue phrases are correlated with focusing operations. 
(The particular predictions are discussed previously in 
Section 5.2.) 

• Cue phrase type and unfilled pause duration are cor- 
related as well. 

The hypothesized correlation of unfilled pause duration with 
focusing operations is based on assumptions about variations 
in complexity among the operations, such that longer pauses 
will accompany more complex operations. Complexity is con- 
jectured to correlate with kind and number. That is, it varies 
according to whether the operation decomposes into pops, a 
push or both and it increase with the number of segments 
opened or closed in one operation. 

This produces the particular predictions that: 

• Retentions will be preceded by pauses of the small- 
est duration because they induce neither a push nor 
pop and therefore are the least costly of the focusing 
operations. 



• Pause duration is positively correlated with the number 
of segments affected in one focusing operation. That 
is, the more segments opened or closed, the longer the 
preceding pause. 

• Pops are more costly than pushes. This follows from 
an assumption that adding information (a push) builds 
on what is currently established and accessible, while 
removing information (one or more pops) makes the 
production of subsequent utterances more difficult. 

8. Data 

I analyzed two discourse samples — three minutes of a di- 
rections discourse and seven minutes of a manager-employee 
project meeting. The segmentation of the second proved dif- 
ficult and is still in progress, so I report results only for the 
first. 

In the directions discourse. Speaker B provides Speaker A 
with walking directions to a location on the M.I.T. campus. 
The discourse takes the form of an expert-client dialogue. Al- 
though Speaker A initiates the dialogue, most of the discourse 
segments and their intentions are introduced by Speaker B, 
the expert.Ll 

9. Methods 

The search for correlations among cue phrases, unfilled pauses 
and discourse structure generated three data collection tasks: 

• Identification of cue phrases; 

• Identification and measurement of unfilled pauses; 

• Segmentation of the discourse via the identification of 
the focusing operations that effected the segmentation. 

9.1. Cue phrase identification 

The main challenge of cue phrase identification lay in distin- 
guishing cue from non-cue uses of a phrase. Usually, cue uses 
are utterance- or segment-initial, while non-cue uses are not. 
However, this is not a reliable criterion for the connectives. 
And and But, which may head an utterance or phrase as ei- 
ther a cue phrase or a syntactic conjunctive. In cases where 
the usage was unclear, I decided against the pragmatic usage 
if the phrase in question provided syntactic coordination of 
two semantically related propositions. If even this judgment 
proved difficult, I applied the intonational criteria that distin- 
guished cue and non-cue uses of A^ovv[Q|. Thus, if the cue 
phrase candidate was deaccented, or accented with L* tones 
or uttered as a complete intonational phrase, it was classified 
as a cue phrase. 



9.2. Pause location and measurement 

Pauses were identified by ear and corroborated and measured 
using the waveform and the energy track displays of two 
signal processing programs.Q The locations of all unfilled 
pauses were recorded, as were their durations, rounded to the 
nearest one tenth of a second. 

In general, the procedure was straightforward. The only con- 
fusion was presented by the silence between the closure and 
release phase of plosives. This silence was not counted as a 
genuine unfilled pause. 

9.3. Discourse segmentation 

An accurate discourse segmentation falls out of an accurate 
classification of the focusing operations by which the seg- 
ments have been constructed. The tasks are interrelated and 
both are difficult. Therefore, in this section I will discuss in 
detail the task, its difficulties and the classification criteria I 
developed to enhance the accuracy of my judgments. 

The task The segmentation of a completed discourse is 
equivalently the task of recapturing the attentional state that 
accompanied each successive utterance. Attentional cues are 
especially important because topical relations do not always 
predict discourse structure. The points at which discourse 
structure diverges from the organization of information in the 
domain may be precisely the points at which attentional cues 
are most appropriate. 

Segmentation of a completed discourse is most straightfor- 
ward for expository text. In such discourse, domain and 
attentional hierarchies often coincide — the relations among 
segments and of each segment to the overall Discourse Pur- 
pose are clear. In spoken and impromptu discourse, however, 
the alignment of DSPs is not always so felicitous. Even in the 
task-oriented directions discourse, the relations among steps 
in the task did not conclusively determine the relations of the 
discourse segments in which these steps were described. 

The particular segmentation difficulties presented by my sam- 
ple(s) led to the development of explicit criteria for isolating 
the corroborating features of attentional operations and dis- 
course structure. The criteria help clarify confusion from two 
sources — the distinction between attentional and domain hi- 
erarchies and the interpretation of underspecified lexical and 
prosodic attentional cues. 

Separating the attentional from the topical. In prepared 
discourse (written or spoken) the intentional structure is 
tightly coupled to the Discourse Purpose. In contrast, im- 
promptu discourse exhibits a looser coupling, owing to its 



'The conversation occurred in a face-to-face encounter and was recorded 
on a hand-held cassette recorder 



^SPIRE, written for the LISP machine by Victor Zue's group at M.I.T. 
and dspB (digital signal processing workBench) written for the DECstation 
by Dan Ellis at the M.I.T. Media Laboratory. 



real-time and situated nature. In such discourse, the main- 
tenance of coherence requires the real-time management of 
cognitive resources upon which competing demands may be 
made. As a consequence, influences outside the ostensible DP 
must be managed in support of continuing the conversation at 
all. Because DSPs that are ostensibly outside the current DP 
can become temporarily relevant, provision must be made for 
their principled incorporation into the attentional state and in 
the linguistic and intentional structures. 

This is accomplished via attentional constructions that are 
more likely to occur in spoken discourse, for example, flash- 
backs, digressions and interruptions [^. Their relation to the 
discourse in which they occur illustrates the difficulty of seg- 
menting in hindsight a discourse whose DSPs may satisfy 
multiple DPs. This recommends against reliance on domain 
knowledge, since one discourse may invoke more than one 
domain. 

Therefore, to locate segment boundaries, I use criteria that 
emphasize focusing operations independent of the ostensible 
DP. For example, although the succession of two topically 
unrelated segments might suggest a Replace operation, it is 
treated as an Initiate in the presence of explicit indicators of 
linkage or in the absence of explicit indicators of separation. 
Consequently, successive segments may be linked hierarchi- 
cally in the attentional and linguistic structures despite their 
topical independence. 

For example, in the following section of the directions dis- 
course ( 1 ) is a topic introduction, ( 2 ) a digression and ( 3 ) 
an elaboration, i.e., a subtopic: 

( 1 ) To your left, 

( 2 ) if you have followed these directions faithfully, 
(3) y'know you'll be facing a wall straight ahead of 
you, 

Although ( 2 ) is a comment on discourse processing, it func- 
tions neither as a cue phrase nor a synchronization device. 
The digression it represents it not topically subordinate to 
( 1 ) , nor is ( 3 ) topically subordinate to ( 2 ) . However, they 
are attentionally subordinate to the utterances they follow, as 
indicated by the continuation rises at the end of (1) and (2). 
While the semantic and topical differences between succes- 
sive utterances argue for segment separation, the acoustical 
concomitants argue against. Therefore, the attentional moves 
that introduce ( 2 ) and ( 3 ) contain no pops. Instead, they 
are Initiations, producing the following segmentation: 

Replace ( 1 ) To your left. 

Initiate (2) if you have followed these directions faith- 
fully. 

Initiate (3) y'know you'll be facing a wall straight 

ahead of you. 



Interpreting underspecifled cues Even when successive 
utterances are aligned attentionally and topically, their cue 
phrase and prosodic markings may not conclusively reveal 
their exact place in discourse structures. The underspecified 
nature of cue phrase correspondences to focusing operations 
is discussed in Section 5.2. Prosodic marking is similarly 
underspecified, and on two counts. First, a particular intona- 
tional feature at the (e.g., phrase-final lowering, phrase-initial 
pitch range expansion) can felicitously indicate more than one 
focusing operation; second, the intonation at a phrase bound- 
ary often indicates stack primitives (push, pop, null) more 
reliably than the composite focusing operations from which 
discourse structure is deduced. 

For example, in the directions discourse, the cue phrases So, 
But and And often indicated pops, as did the prosodic changes 
that accompanied them, e.g., expanded pitch range and a shift 
from L* to H* tones. However, these cues did not reveal 
exactly how many segments were popped nor whether a push 
followed the sequence of pops. Thus, it was not always easy 
to distinguish a Return (one or more pops) from a Replace 
(one or more pops, followed by a push). 

Neither domain nor syntactic knowledge were conclusive in 
this regard. For example, domain and syntax dictated the 
following segmentation: 

Return ( 4 ) And you need to turn left and then walk along 
Building Five. 

Initiate ( 5 ) And you'll be walking through the architecture 
lofts. 

but in contraindication to what was specified intonationally: 

Return ( 4 ) And you need to turn left and then walk along 
Building Five. 

Retain (5) And you'll be walking through the architecture 
lofts. 

(The intonationally driven segmentation, in contradiction to 
the structure of knowledge in the domain, may account for the 
listener's subsequent confusion about the very point made in 
this section of the discourse.) 

Classification criteria Because semantic clues to attentional 
state can be confusing and lexical and prosodic markings in- 
conclusive, it became necessary to standardize the procedure 
and criteria for classifying the focusing operations. An accu- 
rate classification depends on the answers to two questions for 
the phrase undergoing classification: Has a new focus space 
been opened? Has an old focus space been closed? Most 
useful in this regard are the lexical and prosodic phenomena 
within and around the phenomena currently under evaluation 
for their attentional effect. 

What constitutes current phenomena, and what might consti- 
tute its surrounds? I selected as current the speech fragment 
that begins with one of five fragment-initial tokens and whose 



terminating boundary is marked by the occurrence of the next 
fragment-initial token. These tokens are: 

• The unfilled pause; 

• The filled pause; 

• A cue phrase; 

• An acknowledgment form; Ok, Sure, Uh-huh, etc.; 

• Or the unmarked case: any other sentence-initial gram- 
matical constituent, e.g., a noun phrase, auxiliary verb, 
complementizer or adverb. 

My demarcation of the relevant surrounding phenomena was 
less bound to structure than to function. For both prior and 
subsequent phenomena, I selected the smallest speech frag- 
ment that could be distinguished by its discourse function, 
i.e., by its attentional, coordination or topical role. I assign 
five classifications: 

• A cue phrase; 

• An acknowledgment or prompt; 

• A segment closure (e.g.. Good!); 

• A repair; 

• Or the unmarked case — development of the topic. 

The lexical and acoustical features of prior, current and subse- 
quent speech fragments are treated as corrobating evidence for 
the attentional operation associated with the current speech 
fragment. Often this evidence indicated a stack primitive — 
push or pop — rather than a full-fledged focusing operation. 
This is illustrated in Table ^ which catalogues the lexical and 
prosodic features exhibited by prior, current and subsequent 
speech, and the stack and focusing operations for which each 
is considered evidence. 

Coding the data The data relevant to every speech fragment 
was coded for later statistical analysis. This translated into 
two tasks — identifying the prior, current and subsequent 
speech fragment and for each current fragment, recording: 

• The duration of the preceding unfilled pause; 

• The type of fragment-initial constituent, either: 

— A cue phrase; 

— An explicit acknowledgment form (e.g.. Ok, 
Sure.); 

— A filled pause; 

— Or any other sentence-initial syntactic form 
whose function is primarily topical, not prag- 
matic. 

• The co-occurring focusing operation. 

• The embedding of the current segment in the linguistic 
structure (number of levels). 



GIVEN: 



CONCLUDE: 



Focusing 
Operation 



Speech 
Evidence 


Feature 


Stack 
Primitive 


FOR Current 
Fragment 


Prior 
speech 


Falling phrase-final 
intonation, 
acknowledgment, 
lexical/semantic closure. 


Pop of 

co-occurring 

segment(s). 


Replace. 




Phrase-final 
continuation rise. 


Null 


Retain. 


Current 
speech 


Pronominalization, 
reduced pitch range, 
nonstandard phonation, 
many L''^ accents 
(parentheticals), 
relative clause, 
Now, Y'know, 
Ordinal cue phase. 


Push 
or Null 
for 

co-occurring 
segment. 


Initiate, 
Retain. 




Nonpronominalized 
repetition (e.g., segue), 
expanded pitch range, 
reintroduction of 
normal phonation, 
So, But. 


Pop of 

previous 

segment(s). 


Return, 
Replace. 




Falling phrase-final 

intonation, 

acknowledgment, 

prompt, 

lexical closure, 

phrase-final creaky 

voice. 


Impending 
Pop of 
co-occurring 
segment(s). 


Retain 
(but an 
impending 
Return 
or 

Replace). 


Subsequent 
speech 


Nonpronominalized 
repetition (e.g, a segue), 
expanded pitch range, 
normal phonation. 
So, But, Now. 


Pop of 

previous 

segment(s). 


Return, 
Replace. 



Table 2: The lexical and acoustical features that sup- 
port classifications of stack primitives and focusing opera- 
tion(s). A co-occurring segment denotes the segment contain- 
ing the speech (prior, current, subsequent) under examination. 
The focusing operations, however, describe the attentional in- 
terpretation that such speech indicates for the current speech 
fragment. 



i The discourse function of the immediately prior speech 
(cue phrase, acknowledgment, closure, filled pause, 
repair, topical but none of the above). 



The number of segments opened or closed in the fo- 
cusing operation. 



• The discourse function of the immediately subsequent 
speech (same categories as for prior speech). 

• Whether the speaker was initiating or continuing a 
speaking turn with the current fragment. 

Using this metric, one hundred speech fragments were identi- 
fied according and their features coded. The coded represen- 
tation of the discourse was then analyzed for distributions and 
statistical correlations.!] The results are reported in the next 
section. 

10. Results 

In this section I summarize the raw data, report the results of 
statistical tests and offer an explanation of the findings. 

10.1. Data 

The segmentation of the discourse was reconstructed accord- 
ing to the focusing operations indicated both lexically and 
acoustically. The segmentation described a discourse with 
two top level segments. Within the first, the overall task was 
defined; within the second, it was executed. The task defini- 
tion segment was itself composed of two top level segments, 
while the execution segment is composed of nine. 

The key elements of the coding scheme were, of course, the 
focusing operation, the fragment-initial token and the duration 
of the unfilled pause preceding the fragment. Distributions for 
these categories are catalogued in Table ^ 

10.2. Statistical analyses 

The predictions were analyzed via statistical tests on the coded 
representation of the discourse. 

Pause duration and focusing operation A comparison of 
the mean pause duration for each focusing operation showed 
a significant difference among the operations (F(3,96)=7.31, 
/7<.001). The data in Table^point to the Replace operation ax 
most different from the other three operations in this regard.l3 

Pause duration and number of segments affected in a fo- 
cusing operation Longer pauses were positively correlated 
with the number of segments opened or closed during one 
focusing operation (r = .357, /9<.001). This finding might 
partially explain the long pauses that appear before a Replace, 
since a Replace is the focusing operation most Ukely to affect 
the most focus spaces. By definition, it requires [almost] ev- 
erything to be popped from the focusing before the initiation 
(push) of a new focus space. 



The discourse function classifications and tlie witliin-/between-tum dis- 
tinctions were recorded to track ttie features influencing tlie judgment of 
focusing operation, but were not included in any calculations. 

^However, the importance of this observation is offset by the small sample 
size and large standard deviation. 
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Marked 
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r ocusing 


■ ■ 

Initiate 


13 


10 




Retain 


18 


37 




Return 


6 


5 




Replace 


2 


_4 




ALL 


44 


56 






Initial 


Internal 


r raginent- 


/\na 


3 


4 


initial 


But 


2 


1 


prmctitiipiif 

^Ulia LI L UCll L 


Now 


2 


- 




Oh 

wn 


2 


— 




ou 


3 


2 




Well 


2 


- 




Y know 


2 


- 




Ordinal cue phrase 


1 


- 




A ckno w ledgment 


2 


7 




Filled Pause 


7 


4 




XJ nm arked 


19 


37 




AT T 


45 


54 




— 

osconds 


Initial 


Internal 


unniieu 


U.U 


5 


15 


pauses 


U. 1 


6 


11 




0.2 


3 


15 




0.3 
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J 




0.4 


11 


5 




0.5 


1 


5 




0.6 


3 


4 




0.7 


4 


2 




0.8 


1 






0.9 


1 






1.7 


1 






2.0 


J. 








41 


62 




Average 


0.422 seconds 


.224 seconds 



Table 3: Distributions of fragment-initial constituents, fo- 
cusing operations and pause durations. Separate counts 
are taken for segment-initial and segment-internal phenomena 
and for marked and unmarked. A marked focusing operation 
begins with a cue phrase, an acknowledgment form or a fiUed 
pause, while an Unmarked operation does not. 



Focusing Number Mean Pause Standard 

Operation of Tokens Duration (seconds) Deviation 

Initiate 23 0.3217 0.2173 

Retain 55 0.2091 0.1818 

Return 11 0.2545 0.2505 

Replace 11 0.6500 0.6727 

Table 4: Mean pause durations for each focusing opera- 
tion. 



Initial 

Token Initiate Retain Return Replace ALL 



Speech 

Fragment Initiate Retain Return Replace ALL 

Marked 0.30 13 0.17 18 0.13 6 0.36 7 0.24 44 

Unmarked 035 10 (123 37 040 5 L15 4 033 56 

ALL 0.32 23 0.21 55 0.26 11 0.65 11 0.29 100 

Table 6: The mean duration, in seconds, of the pause 
preceding focusing operations and marked or unmarked 
speech fragments that co-occur. The number of tokens in 
the calculation follows the mean value. 



operations was 0.33 seconds (standard deviation = 0.36). Sta- 
tistically this approaches significance (T(96) = 1.58,/? = .12). 

10.3. Discussion 

Thus far, analysis of the data identifies significantly longer 
pauses for the Replace operation than for any other and shows 
that pause duration is positively correlated with the number of 
segments affected by one focusing operation. These findings 
begin to distinguish the focusing operations quantitatively, 
by number of focus spaces affected, and qualitatively, by 
whether they occur within an established context (Initiate, 
Retain, Return) or at its beginning {Replace). 

Although, the raw data in Table ^ appears to show patterns 
for specific segment-initial tokens, the number of tokens is 
insufficient for establishing a correlation between cue phrase 
and focusing operations, let alone a three-way relationship 
among cue phrase, pause duration and focusing tasks. 

The categorical classification present particular problems. 
For, uncertainties arose even with the application of a classi- 
fication metric. Perhaps these uncertainties should have been 
incorporated into the coding scheme or perhapsJhe categor- 
ical classifications should have been abandonecH in favor of 
additional and quantifiable acoustical and lexical features. 

10.4. Refining the original hypotheses 

Only partial conclusions can be drawn from the data. How- 
ever, the results are useful toward refining the original hy- 
potheses and determining the content of future research. The 
distinction between the pause data for marked and unmarked 
fragments is a case in point. For each focusing operation, 
the difference between mean pause durations at best only ap- 
proaches significance (see Table [24 However, because 
the values for all focusing operations are always greater for 
unmarked utterances, a hypothesis is suggested: that, given a 
speech fragment and the focusing operation it evinces, the pre- 
ceding unfilled pause will be longer if the fragment is lexically 
unmarked. 



'at least, in this stage of the investigation 
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Unmarked 


0.35 


10 


0.23 


37 
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1.15 
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033 
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ALL 


0.32 


23 


021 


55 


0.26 


11 


065 


11 
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Table 5: The mean duration, in seconds, of the pause 
preceding fragment-initial tokens and focusing operations 
that co-occur. The number of tokens in the calculation fol- 
lows the mean value. 



Pause duration and depth of embedding A correlation of 
pause duration and the depth of embedding in the linguistic 
structure (or equivalently, the number of focus spaces still 
on the stack) showed no significant effect on pause duration 
(F(l,98) = 0.1861, p<.7). 

Pause duration, cue phrase and focusing operation The 

directions dialogue contained too few fragment-initial tokens 
to calculate meaningful statistics about their relation to fo- 
cusing operations. Therefore, the best course was to select 
from the raw data (see Table |[) the patterns that were likely 
candidates for further testing. For example. So was never 
associated with an Initiate operation and also was preceded 
by the smallest mean pause durations (0. 13 seconds). A filled 
pause, with a similar mean pause duration (0. 14 seconds) was 
primarily associated with Initiates and Retains but never with 
Replace. And, while Anc/ shared the same focusing operations 
as a filled pause, its mean value for pause duration was more 
than twice as large (0.33 seconds). 

Pause duration and marked/unmarked To compensate for 
the small sample sizes of the cue phrase data, all explicit lex- 
ical markers of structure (cue phrase, acknowledgment, filled 
pause) were collapsed into the category, marked. The data 
in this category were compared to the data for lexically un- 
marked fragments. Because the longest pauses preceded un- 
marked Returns and Replacements, I predicted that unmarked 
operations would in general be preceded by longer pauses 
than marked. 

The results are in the direction predicted and are summa- 
rized in Table ^ The average duration for pauses preceding a 
marked focusing operation was 0.24 seconds (standard devia- 
tion = 0.24), while the average for pauses preceding unmarked 



If this hypotheses is correct, two accounts can be constructed 
that would jointly predict the appearance of cue phrases. One 
account emphasizes the processes involved in choosing and 
communicating the state of global focus. The other empha- 
sizes the mutually recognized (by speaker and hearers) atten- 
tional and intentional state of the discourse. Together they 
identify the factors that would impel a speaker to precede an 
utterance with a cue phrase, an unfilled pause or both. 

The influence of the speaker's internal processes and con- 
versational goals If an unfilled pause preceding a lexically 
unmarked fragment is significantly longer, we might assume 
that a particular focusing operation is executed in a charac- 
teristic amount of time (given adequate consideration of other 
contextual features). Withiolhis time, we might observe si- 
lence, a cue phrase or both. El 

Because both pause and cue phrase can appear at the same 
location in a phrase, we ask if their functions are equivalent, 
or instead, complementary. My hypothesis selects the second 
option, that they are complementary in the cognitive process- 
ing each reflects and in the discourse functions each fulfills. 
For, if the duration of an unfilled pause is evidence of the 
difficulty of a cognitive task, a cue phrase is evidence of its 
partial resolution. 

As a communicative device, cue phrases are more cooperative 
than silence. In silence, a listener can only guess at the 
current contents of the speaker's models. With the uttering of 
a cue phrase, the listener is at least notified that the speaker 
is constructing a response. The minimal cue in this regard is 
the filled pause. Bone fide cue phrases, however, herald not 
only an upcoming utterance, but a particular direction of focus 
and even a propositional relation between prior and upcoming 
speech. 

Cue phrases serve not only the listener but also the speaker 
Because they commit to topic structure, but not to specific 
referents and discourse entities, they buy additional time for 
the speaker in which to complete a focusing operation and 
formulate the remainder of the utterance. 

The influence of the state of the discourse The account 
of the influence of the currently observable state of the dis- 
course rests on two patterns in the data: (1) the difference in 
pause durations for marked and unmarked Initiates and Re- 
tains is minimal; and (2) the difference between marked and 
unmarked Returns and Replaces is greater If these patterns 
can be shown to be significant, they suggest that remaining 
in the current context is less costly than returning to a former 
context, or establishing a new one. The corollary is the claim 
that an expected focusing operation need not be marked, while 
an unexpected operation is most felicitous when marked. 



'"The discussion will focus on cue phrases, even though the points are 
relevant to other lexical markers of discourse structure and processing. 



In other words, remaining in the current context or enter- 
ing a subordinate context is expected behavior, while exiting 
the current context is not. Exiting the current context (focus 
space) carries a greater risk of disrupting a mutual view of 
discourse structures. The extent of risk is assessed for the 
listener by the difficulty of tracking the change and for the 
speaker, by the difficulty of executing it. The risk originates 
in the nondeterministic definitions of Return and Replace op- 
erations — both contain in thek structure one or more pops. 
In addition, these operations can be confused because both 
begin identically, with a series of pops. 

Because closing a focus space is a marked behavior, the clues 
to changing focus are most cooperative if they guide the lis- 
tener toward re-invoking a prior context (i.e., a Return) or 
establishing a new one (Replace). Thus, certain clues are 
more likely to mark a return to a former context (e.g.. So, 
Anyway, As I was saying), while others {Now, the ordinal 
phrases) mark a Replace. 

Future work The goal of future investigations is to establish 
the bases for predicting the appearance of particular acoustical 
and lexical features. The speculations presented in this section 
provide a theoretical framework. If borne out, they can be re- 
fashioned as characterizations of the circumstances in which 
cue phrases and unfilled pauses are most likely to be used. 



11. Conclusion 

The relationships among cue phrases, unfilled pauses and the 
structuring of discourse are investigated within the paradigm 
of the tripartite model of discourse. Within this model, the 
postulation of four focusing operations provides an opera- 
tional framework to which can be tied the discourse functions 
of cue phrases and the cognitive activity associated with the 
production of an utterance. Especially, the difficulty of utter- 
ance production might be explained by the complexity of the 
co-occurring focusing operation. Such a correspondence is, 
in fact, suggested by the positive correlation of pause dura- 
tion and the number of focus spaces opened or closed in one 
operation on the focus space stack. 

However, because the classification of focusing operations 
is uncertain, more data and better tests are required to char- 
acterize the relationships among the lexical and acoustical 
correlates of topic and focus. In addition, the aptness of the 
tripartite model itself is not assured. The idealizations it con- 
tains may undergo modification in light of new results, or be 
augmented by other accounts of discourse processing. On the 
other hand, the analysis of more quantitative data may confirm 
the implications of the model, and its appropriateness as the 
foundation for investigating the lexical and prosodic features 
of discourse. 
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